options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("pacman")Take-home Exercise 1: Demographic structures and distribution of Singapore in 2024
1 Overview
A local online media company that publishes daily content on digital platforms is planning to release an article on demographic structures and distribution of Singapore in 2024.
2 Objective
Assuming the role of the graphical editor of the media company, you are tasked to prepare at most three data visualisations for the article.
3 Analytical Toolkit: RStudio
RStudio and Quarto are used as the primary analytical toolkit for this project. The data is processed using appropriate tidyverse family of packages and the data visualisation prepared using ggplot2 and its extensions.
Before we get started, it is important for us to ensure that the required R packages have been installed.
If you have yet to install pacman, install itby typing below in the Console:
We then load the following R packages using the pacman::p_load() function:
- tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs.
- patchwork for combining multiple ggplot2 graphs into one figure.
- plotly, R library for plotting interactive statistical graphs.
- ggrepel: a R package provides geoms for ggplot2 to repel overlapping text labels.
- ggthemes: a R package provides some extra themes, geoms, and scales for ggplot.
- hrbrthemes: a R package provides typography-centric themes and theme components for ggplot2.
- qreport: Provides statistical components, tables, and graphs. - ggiraph: for making βggplotβ graphics interactive.
pacman::p_load(tidyverse, patchwork,
plotly, ggrepel,
ggthemes, hrbrthemes, ggiraph, DT, qreport)4 Data
Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex, June 2024 dataset shared by Department of Statistics, Singapore(DOS)
4.1 Load the Data
First we load the data.
demographic_data <- read_csv("data/respopagesexfa2024.csv")
4.2 Check the Data
From the first glance, we notice that there are β0β Pop in the dataset. For this exercise, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area arenβt necessary. We should exclude those rows with zero population values at the Planning Area level. This will help clean up the data and make the visualizations clearer by removing unnecessary zeros. We will do it in Chapter 4.3
Here I am using qreportpackageβs dataOverview that I already pre-load earlier.
dataOverview(demographic_data, plot = c("none"),)demographic_data has 75696 observations (75696 complete) and 7 variables (7 complete)
|Variable |Type | Distinct| Info| Symmetry| NAs|Rarest Value | Frequency of Rarest Value|Mode | Frequency of Mode|
|:--------|:----------|--------:|-----:|--------:|---:|:-----------------------|-------------------------:|:-----------|-----------------:|
|PA |Nonnumeric | 55| 0.999| 0.991| 0|Central Water Catchment | 228|Bukit Merah | 3876|
|SZ |Nonnumeric | 332| 1.000| 1.000| 0|Admiralty | 228|Admiralty | 228|
|AG |Discrete | 19| 0.997| 1.000| 0|0_to_4 | 3984|0_to_4 | 3984|
|Sex |Discrete | 2| 0.750| 1.000| 0|Females | 37848|Females | 37848|
|FA |Discrete | 6| 0.972| 1.000| 0|<= 60 | 12616|<= 60 | 12616|
|Pop |Continuous | 183| 0.831| 6.953| 0|1260 | 1|0 | 41742|
|Time |Discrete | 1| 0.000| 1.000| 0|2024 | 75696|2024 | 75696|
Letβs also count whatβs the total Pop
cntpop <- demographic_data %>%
summarise(Pop = sum(Pop, na.rm = TRUE))
cat(cntpop$Pop)4187720
Observation
The data shows Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex as of June 2024 with total population of 4,187,720.
We observe that there are 75,696 rows and 7 columns. No missing values are observed. Refer to the column legend in Appendix A
There are a total of seven attributes. 5 of them are categorical data type and the other three are in numerical data type.
- The categorical attributes are: PA, SZ, AG, Sex, FA.
- The numerical attributes are: Pop, Time.
We can also observe how many distinct values for each Variable. This will help us think what to use for our visualization.
4.3 Data Preparation
4.3.1 Cleaning Data
As mentioned earlier, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area arenβt necessary. Here we will exclude those rows with zero population values at the Planning Area level by using filter.
Clean using filter and display as data table
demographic_data_clean <- demographic_data %>%
filter(Pop > 0)
DT::datatable(demographic_data_clean , options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)))4.3.2 Age Group recoding
We will be recoding the Age Group to three groups for clearer visualization: Below 20 years, 20-64 Years, 65 years & Over following National Statistical Standards Recommendations on Definition and Classification of Age (More details in Appendix B ).
demographic_data_recode <- demographic_data_clean %>%
mutate(
AG_recode = recode(
AG,
"0_to_4" = "Below 20 Years",
"5_to_9" = "Below 20 Years",
"10_to_14" = "Below 20 Years",
"15_to_19" = "Below 20 Years",
"20_to_24" = "20-64 Years",
"25_to_29" = "20-64 Years",
"30_to_34" = "20-64 Years",
"35_to_39" = "20-64 Years",
"40_to_44" = "20-64 Years",
"45_to_49" = "20-64 Years",
"50_to_54" = "20-64 Years",
"55_to_59" = "20-64 Years",
"60_to_64" = "20-64 Years",
"65_to_69" = "65 Years and Above",
"70_to_74" = "65 Years and Above",
"75_to_79" = "65 Years and Above",
"80_to_84" = "65 Years and Above",
"85_to_89" = "65 Years and Above",
"90_and_over" = "65 Years and Above"
)
)
DT::datatable(demographic_data_recode , options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)))5 Data Visualisation, Observation, and Insights
5.1 Top 10 Planning Areas (PA) Ranked by Size of Resident Population (Pop)

top10PA <- demographic_data_clean %>%
group_by(PA) %>%
summarise(Pop = sum(Pop, na.rm = TRUE)) %>%
slice_max(order_by = Pop, n = 10)
top10plot <- ggplot(data = top10PA,
aes(y = reorder(PA, Pop/1000), x = Pop/1000)) + # reorder PA by Pop
geom_col(show.legend = FALSE, fill = "pink4") +
geom_text(aes(label = (Pop/1000)),
hjust = -0.2, color = "black", size = 3) +
ggtitle("Top 10 Planning Areas in 2024\nRanked by Size of Singapore Resident Population",
subtitle = paste("Total resident population:",
format(round(cntpop$Pop / 1000, 2), big.mark = ","),
"thousand")) +
labs(
y = NULL,
x = "Resident Population\nin thousands (β000)",
caption = "Source: singstat.gov.sg"
) +
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "",
axis_title_face = "bold",
axis_title_size = 11) +
theme(axis.text.x = element_blank(),
axis.text.y = element_text(size=11, face="bold"),
axis.title.x = element_text(hjust = 0.5)
)+
scale_x_continuous(expand = expansion(mult = c(0, 0.1)))
top10plot# Total population of the top 10 Planning Area
cntpoptop10 <- top10PA %>%
summarise(Pop = sum(Pop, na.rm = TRUE))
cat(cntpoptop10$Pop)2358550
# The percentage of the population of the top 10 most populous Planning Area relative to the total population.
perc_pop_top10 <- (cntpoptop10 / cntpop) * 100
cat(perc_pop_top10$Pop)56.32062
π Insights Plot 1
Slightly over half (56.3%) of the 4,187.72 thousand (~4.19 million) residents in Singapore stayed in the top 10 planning areas of residence.
There were five planning areas with more than 250,000 residents each, namely Tampines, Bedok, Sengkang, Jurong West, and Woodlands.
Tampines was the most populated with 284,720 residents.
5.3 Age Group and Sex distribution

# Define age group levels and labels
AG_levels <- c(
"0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24",
"25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49",
"50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74",
"75_to_79", "80_to_84", "85_to_89", "90_and_over"
)
AG_labels <- c(
"0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39",
"40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79",
"80-84", "85-89", ">90"
)
AG_lookup <- setNames(AG_labels, AG_levels)
# Compute population shares by age group
AGsum2 <- demographic_data_clean %>%
group_by(AG) %>%
summarise(Pop = sum(Pop, na.rm = TRUE), .groups = "drop") %>%
mutate(
AG = factor(AG, levels = AG_levels),
Pop_share = Pop / sum(Pop),
cum_share = cumsum(Pop_share),
percentile = round(cum_share * 100, 1)
) %>%
arrange(AG)
# Median and Q3 age groups
median_AG <- AGsum2 %>% filter(cumsum(Pop) >= sum(Pop) / 2) %>% slice(1) %>% pull(AG)
q3_AG <- AGsum2 %>% filter(cumsum(Pop) >= sum(Pop) * 0.75) %>% slice(1) %>% pull(AG)
# Convert median and Q3 to labels
median_label <- AG_lookup[as.character(median_AG)]
q3_label <- AG_lookup[as.character(q3_AG)]
# Prepare population pyramid data
pyramid_data <- demographic_data_clean %>%
group_by(AG, Sex) %>%
summarise(Population = sum(Pop, na.rm = TRUE) / 1000, .groups = "drop") %>%
mutate(
AG = factor(AG, levels = AG_levels, labels = AG_labels, ordered = TRUE),
PopPercentage = ifelse(Sex == "Females", -Population, Population) / sum(Population) * 100,
PopPercentage = round(PopPercentage, 2),
Signal = ifelse(Sex == "Females", -1, 1)
)
# Create the population pyramid plot
pyramid_plot <- ggplot(pyramid_data, aes(x = AG, y = PopPercentage, fill = Sex)) +
geom_bar(stat = "identity") +
geom_text(aes(y = PopPercentage + Signal * 0.5, label = abs(PopPercentage)),
size = 3, color = "black") +
coord_flip() +
scale_fill_manual(values = c("Females" = "pink2", "Males" = "steelblue"),
guide = guide_legend(override.aes = list(fill = NA))) +
scale_y_continuous(labels = abs) +
annotate("segment", x = median_label, xend = median_label, y = -4, yend = 3.6,
color = "red4", linewidth = 0.7, linetype = "dotted") +
annotate("text", x = median_label, y = 5, label = "Median",
color = "red4", size = 2.8, fontface = "bold") +
annotate("segment", x = q3_label, xend = q3_label, y = -3.65, yend = 3.65,
color = "red4", linewidth = 0.7, linetype = "dotted") +
annotate("text", x = q3_label, y = 5, label = "Q3",
color = "red4", size = 2.8, fontface = "bold") +
ggtitle("Population Pyramid of Singapore Residents\nby Age and Sex 2024",
subtitle = "in percentage (%)") +
labs(
y = "Population (%)",
x = "Age Group",
fill = "Sex",
caption = "Source: singstat.gov.sg"
) +
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "Y",
axis_title_face = "bold",
axis_title_size = 11,
axis_text_size = 8) +
theme(
strip.text = element_text(face = "bold"),
axis.title.x = element_text(hjust = 0.5),
axis.title.y = element_text(hjust = 0.5),
legend.position = "top",
legend.title = element_blank(),
legend.justification = c(0.45, 0),
legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt")
)
pyramid_plotAlternatively for more simpler visualisation we can also use this plot, though we can only infer Age Distribution Insights from this plot.

# Details of the population pyramid distribution
print(pyramid_data, n = 38)# A tibble: 38 Γ 5
AG Sex Population PopPercentage Signal
<ord> <chr> <dbl> <dbl> <dbl>
1 0-4 Females 83.4 -1.99 -1
2 0-4 Males 87.5 2.09 1
3 10-14 Females 100. -2.4 -1
4 10-14 Males 104. 2.49 1
5 15-19 Females 104. -2.49 -1
6 15-19 Males 107. 2.56 1
7 20-24 Females 110. -2.64 -1
8 20-24 Males 115. 2.74 1
9 25-29 Females 135. -3.22 -1
10 25-29 Males 135. 3.23 1
11 30-34 Females 166. -3.95 -1
12 30-34 Males 156. 3.71 1
13 35-39 Females 166 -3.96 -1
14 35-39 Males 149. 3.56 1
15 40-44 Females 164. -3.92 -1
16 40-44 Males 147. 3.5 1
17 45-49 Females 158. -3.77 -1
18 45-49 Males 144. 3.44 1
19 50-54 Females 158. -3.78 -1
20 50-54 Males 149. 3.56 1
21 55-59 Females 150. -3.58 -1
22 55-59 Males 145. 3.46 1
23 5-9 Females 98.9 -2.36 -1
24 5-9 Males 104. 2.47 1
25 60-64 Females 150. -3.57 -1
26 60-64 Males 147. 3.52 1
27 65-69 Females 136. -3.25 -1
28 65-69 Males 130. 3.12 1
29 70-74 Females 107. -2.57 -1
30 70-74 Males 99.3 2.37 1
31 75-79 Females 72.9 -1.74 -1
32 75-79 Males 61.9 1.48 1
33 80-84 Females 45.1 -1.08 -1
34 80-84 Males 32.7 0.78 1
35 85-89 Females 27.2 -0.65 -1
36 85-89 Males 16.9 0.4 1
37 >90 Females 17.4 -0.42 -1
38 >90 Males 7.73 0.18 1
# Details of the AG distribution
AGsum2# A tibble: 19 Γ 5
AG Pop Pop_share cum_share percentile
<fct> <dbl> <dbl> <dbl> <dbl>
1 0_to_4 170930 0.0408 0.0408 4.1
2 5_to_9 202420 0.0483 0.0892 8.9
3 10_to_14 204610 0.0489 0.138 13.8
4 15_to_19 211560 0.0505 0.189 18.9
5 20_to_24 225020 0.0537 0.242 24.2
6 25_to_29 270090 0.0645 0.307 30.7
7 30_to_34 321010 0.0767 0.383 38.3
8 35_to_39 315180 0.0753 0.459 45.9
9 40_to_44 310700 0.0742 0.533 53.3
10 45_to_49 301820 0.0721 0.605 60.5
11 50_to_54 307760 0.0735 0.678 67.8
12 55_to_59 294500 0.0703 0.749 74.9
13 60_to_64 297020 0.0709 0.820 82
14 65_to_69 266580 0.0637 0.883 88.3
15 70_to_74 206760 0.0494 0.933 93.3
16 75_to_79 134810 0.0322 0.965 96.5
17 80_to_84 77750 0.0186 0.983 98.3
18 85_to_89 44050 0.0105 0.994 99.4
19 90_and_over 25150 0.00601 1 100
π Insights Plot 2
- The median age falls within 40 to 44 age group. About 25% are aged 60 and above, with 18% aged 65 and older, highlighting aging population trend.
- Children (0β14) account for only about 13.8% of the population, indicating low birth rates. Slightly more males than females are observed in this group.
- The working-age population (15β64 years), as defined by the Ministry of Manpower, comprises around 68.2% of the population. This reflects a strong labour force, though future demographic challenges may arise as this group continues to age.
- The 25-64 age group shows a near-equal gender balance.
- The gender gap widens in the older age cohorts, with majority female in the 80+ age groups. This show that females live longer than males on average, consistent with the life expectancy at birth between the different gendersfrom 2023 report by the Singapore Department of Statistics.
5.3 Association Between Age Group Distribution and Planning Area in Singapore

table_data <- table(demographic_data_recode$AG_recode, demographic_data_recode$PA)
chi_test <- chisq.test(table_data)
residuals_df <- as.data.frame(as.table(chi_test$residuals))
ggplot(residuals_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
geom_text(aes(label = round(Freq, 2)), size = 4) +
scale_fill_gradient2(low = "steelblue", high = "red3", mid = "white", midpoint = 0) +
labs(x = "AG", y = "PA", fill = "Residuals") +
ggtitle("Association Between Age Group\nand Planning Area in Singapore 2024",
subtitle = "Meassured using Pearson's Chi-squared test") +
labs(caption = "Source: singstat.gov.sg")+
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "Y",
axis_title_face = "bold",
axis_title_size = 11,
axis_text_size = 8) +
theme(
strip.text = element_text(face = "bold"),
axis.text.x = element_text(size=10, face="bold", angle = -45, hjust = 0),
axis.text.y = element_text(size=10),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.title = element_blank(),
legend.justification = c(0.45, 0),
legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt"))+
scale_x_discrete(expand = expansion(add = 0.5)) +
scale_y_discrete(expand = expansion(add = 0.5)) π Insights Plot 3
Central Areas like Downtown Core are skewed toward working-age group with strong over-representation of the 20β64 group (+7.46) and lesser children (β4.03) and seniors (β6.69), reflecting its role as a Central Business District. Similar patterns appear in Newton, Museum (Bras Basah, Dhoby Ghaut, Fort Canning), and Singapore River (Boat Quay, Clarke Quay, Robertson Quay).
New BTO areas like Tengah and Punggol show more children and lesser seniors, indicating young families moving into new Towns / Planning Areas.
Mature Towns such as Bedok, Ang Mo Kio, and Hougang exhibit neutral residuals, suggesting stable populations with a mix of age groups and long-term residents.
Affluent Districts such as Bukit Timah, we observe more children (+1.98) and 20-65 group (+2.99) with lesser seniors (-5.75). Proximity to βgood schoolsβ for kiasu parents likely attracts young wealthy families with its abundance of private housing. Similar trends observed in Tanglin.
6 Reference
- ggplot for categorical-data
- Describe function
- gt package
- theme for ggplot2
- Recode Values with dplyr
- Customize tick marks and labels
- National Statistical Standards Recommendations on Definition and Classification of Age
- Cencus of Population 2020
- Population Pyramid Plot
- Ageing Population
- Heatmap ggplot2
- Chi Square in r
7 Appendix
7.1 Appendix A
| Column Name | Description |
|---|---|
| PA | Planning Area |
| SZ | Subzone |
| AG | Age Group |
| Sex | Sex |
| FA | Floor Area of Residence |
| Pop | Resident Count (Population) |
| Time | Time / Period |
7.2 Appendix B
National Statistical Standards Recommendations on Definition and Classification of Age
| AG | AG_recode |
|---|---|
| 0_to_4 | Below 20 years |
| 5_to_9 | Below 20 years |
| 10_to_14 | Below 20 years |
| 15_to_19 | Below 20 years |
| 20_to_24 | 20-64 Years |
| 25_to_29 | 20-64 Years |
| 30_to_34 | 20-64 Years |
| 35_to_39 | 20-64 Years |
| 40_to_44 | 20-64 Years |
| 45_to_49 | 20-64 Years |
| 50_to_54 | 20-64 Years |
| 55_to_59 | 20-64 Years |
| 60_to_64 | 20-64 Years |
| 65_to_69 | 65 years & Over |
| 70_to_74 | 65 years & Over |
| 75_to_79 | 65 years & Over |
| 80_to_84 | 65 years & Over |
| 85_to_89 | 65 years & Over |
| 90_and_over | 65 years & Over |